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L16: Entry 1 of 1 File: USPT Apr 28, 1998 



DOCUMENT- IDENTIFIER : US 5745727 A 

TITLE: Linked caches memory for storing units of information 
Detailed Description Text (30) : 

A number of embodiments of the present invention have been described. Nevertheless, 
it will be understood that various modifications may be made without departing from 
the spirit and scope of the invention. For example, the linked caches of the 
present invention may be used in any system in which a cache is used to store first 
units of information associated with a second unit of information. For example, in 
a database in which addresses are stored in a cache, the street address may be 
stored in a first cache with an index into a second cache which stores the city, 
state, and zip code associated with the address. Accordingly, any device may be 
coupled to the linked caches and the coordination controller 110 to request that 
information be read from the linked caches. 
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L15: Entry 1 of 1 File: USPT Apr 28, 1998 



DOCUMENT- IDENTIFIER: US 5745727 A 

TITLE: Linked caches memory for storing units of information 
Brief Summary Text (13) : 

If a requesting device requests a block of information, a first cache controller 
searches the first cache to determine whether the Exchange Context is present in 
the first cache. If the Exchange Context is not present in the first cache, then 
the first cache controller informs a coordination control logic device to request 
that a microcontroller read the Exchange Context from a main context memory array 
(i.e., a "context array"). The Exchange Context information read from the context 
array is stored in the first cache. In accordance with the present invention, the 
Port Context Index in the first cache is used to direct the second cache controller 
to associated Port Context information within the second cache. That is, the Port 
Context Index is communicated from the first cache to the second cache controller. 
The second cache controller then attempts to locate the Port Context information 
associated with the Exchange Context retrieved from the first cache. If the Port 
Context information is found, then both the Port Context information and the 
Exchange Context information are presented to the requesting device. 

Detailed Description Text (7) : 

In accordance with one embodiment of the present invention, the first and second 
cache controllers 105, 111 are implemented as a single state machine . Operations of 
the first and second cache 105, 111 are preferably sequential. That is, only after 
the first cache controller 105 finds a requested Exchange Context does the second 
cache controller 111 begin searching for the associated Port Context. Preferably, 
the coordination controller 110 is implemented as a second state machine. A Write 
Path Controller 115 is preferably implemented as a third state machine. 
Accordingly, the cache controllers 105, 111, the coordination controller 110, and 
the Write Path Controller 115 each are preferably independent devices within the 
communications adapter 100. 

CLAIMS : 

1. A linked cache memory for storing units of information, the units of information 
being a first and second subset of units of information stored in a related memory 
device, including: 

(a) a first cache device for storing the first subset of the information; 

(b) a second cache device for storing the second subset of the information; 

(c) a cache controller, coupled to the first and second cache devices for: 

(1) receiving from an external device a first index; 

(2) searching the first cache device for a first unit of information associated 
with the first index; 



http://wesfors:9000^ 2/24/04 



(3) outputting a first indication that the first unit of information was found, if 
the first unit of information is presently stored within the first cache device; 

(4) receiving from the first cache device a second index embedded within the first 
unit of information; 

(5) searching the second cache device for a second unit of information associated 
with the second index ; and 

(6) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(d) a coordination controller, coupled to the cache controller, for receiving from 
the cache controller the first and second indications that the first and second 
cache devices have found the first and second units of information, and in response 
to such receipt of such indications, enabling outputting of the first and second 
units of information; and 

(e) a write path controller, coupled to the coordination controller and the first 
and second cache devices, for coupling input signals from one of a plurality of 
sources to inputs of the first or the second cache, and for indicating to the 
source of the input signal that input data represented by the input signal has been 
stored in a cache device. 

4. A linked cache memory for storing units of information, the units of information 
being a first and second subset of units of information stored in a related memory 
device, including: 

(a) a first cache device for storing the first subset of the information; 

(b) a second cache device for storing the second subset of the information; 

(c) a cache controller, coupled to the first and second cache devices for: 

(1) receiving from an external device a first index; 

(2) searching the first cache device for a first unit of information associated 
with the first index ; 

(3) outputting a first indication that the first unit of information was found, if 
the first unit of information is presently stored within the first cache device; 

(4) receiving from the first cache device a second index embedded within the first 
unit of information; 

(5) searching the second cache device for a second unit of information associated 
with the second index ; and 

(6) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(d) a coordination controller, coupled to the cache controller, for receiving from 
the cache controller the first and second indications that the first and second 
cache devices have found the first and second units of information, and in response 
to such receipt of such indications, enabling outputting of the first and second 
units of information, wherein the coordination controller is capable of performing 
a direct memory access operation to read information from the related memory 
device . 

5. A linked cache memory for storing units of information, the units of information 
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being a first and second subset of units of information stored in a related memory 
device, including : 

(a) a first cache device for storing the first subset of the information; 

(b) a second cache device for storing the second subset of the information; 

(c) a cache controller, coupled to the first and second cache devices for: 

(1) receiving from an external device a first index; 

(2) searching the first cache device for a first unit of information associated 
with the first index ; 

(3) outputting a first indication that the first unit of information was found, if 
the first unit of information is presently stored within the first cache device; 

(4) receiving from the first cache device a second index embedded within the first 
unit of information; 

(5) searching the second cache device for a second unit of information associated 
with the second index ; and 

(6) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(d) a coordination controller, coupled to the cache controller, for receiving from 
the cache controller the first and second indications that the first and second 
cache devices have found the first and second units of information, and in response 
to such receipt of such indications, enabling outputting of the first and second 
units of information; and 

(e) a lock means for locking the first and second cache devices to prevent a 
particular unit of information from being altered when that unit of information is 
in use; 

wherein the first unit of information is an exchange context and the second unit of 
information is a port context, the exchange context and the port context being 
associated with a frame of data which is either being transmitted or received, and 
wherein the lock means activates a receive lock bit when a port context or exchange 
context is being used in association with a received frame. 

6. A linked cache memory for storing units of information, the units of information 
being a first and second subset of units of information stored in a related memory 
device, including : 

(a) a first cache device for storing the first subset of the information; 

(b) a second cache device for storing the second subset of the information; 

(c) a cache controller, coupled to the first and second cache devices for: 

(1) receiving from an external device a first index; 

(2) searching the first cache device for a first unit of information associated 
with the first index ; 

(3) outputting a first indication that the first unit of information was found, if 
the first unit of information is presently stored within the first cache device; 



http://westbrs:9000/^ 2/24/04 



(4) receiving from the first cache device a second index embedded within the first 
unit of information; 

(5) searching the second cache device for a second unit of information associated 
with the second index ; and 

(6) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(d) a coordination controller, coupled to the cache controller, for receiving from 
the cache controller the first and second indications that the first and second 
cache devices have found the first and second units of information, and in response 
to such receipt of such indications, enabling outputting of the first and second 
units of information; and 

(e) a lock means for locking the first and second cache devices to prevent a 
particular unit of information from being altered when that unit of information is 
in use; 

wherein the first unit of information is an exchange context and the second unit of 
information is a port context, the exchange context and the port context being 
associated with a frame of data which is either being transmitted or received, and 
wherein the lock means activates a transmit lock bit when a port context or 
exchange context is being used in association with a frame to be transmitted. 

7. A communications adapter within a host, for receiving and transmitting frames of 
data, including: 

(a) a memory device for storing context data including exchange context information 
and port context information; 

(b) a first cache device having shorter read times than the memory device, for 
storing a subset of the exchange context information; 

(c) a second cache device having shorter read times than the memory device, for 
storing a subset of the port context information; 

(d) a cache controller, coupled to the first and second cache device for: 
(1) receiving from an external device a first index; 

<2) searching the first cache device for an exchange context associated with the 
first index ; 

(3) outputting a first indication that the exchange context was found, if the 
exchange context being associated with the first index is presently stored within 
the first cache device; 

(4) receiving from the first cache device a second index embedded within the 
exchange context; 

(5) searching the second cache device for a port context associated with the second 
index ; and 

(6) outputting a second indication that the port context associated with the second 
index was found, if the port context is present within the second cache device; 

(e) a coordination controller, coupled to the cache controller, for receiving from 
the cache controller the first and second indications that the first and second 
cache devices have found the exchange context and port context associated with the 
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first and second index, and enabling outputting of the exchange context and port 
context in response to receiving both the first and second indications; and 

(f) a microcontroller, coupled to the memory device and to the coordination 
controller, for generating exchange context and port context information to be 
stored within the memory device upon receipt of a request from the coordination 
controller, and for directly writing to the first and second cache device. 

11. A method for storing and retrieving units of information in a linked cache 
device having a first memory device and a second memory device, and a cache 
controller, the first and second memory devices each having units of information 
stored within that are a first and second subset, respectively, of units of 
information stored in a related external memory device, including the steps of: 

(a) receiving from an external control device a first index; 

(b) searching the first memory device for a first unit of information associated 
with the first index ; 

(c) communicating a first indication that the first unit of information was found, 
if the first unit of information is presently stored within the first cache device; 



(d) receiving from the first memory device a second index embedded within the first 
unit of information; 

( e ) searching the second cache device for a second unit of information associated 
with the second index ; 

(f) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(g) if the first unit of information associated with the first index is not present 
in the first cache device, then: 

(1) performing a direct memory access operation into the related external memory 
device to read the first unit of information associated with the first index and 
store that first unit of information in the first cache device; 

(2) communicating a first indication that the first unit of information associated 
with the first cache device is presently within the first cache device; 

(h) if the second unit of information associated with the second index is not 
present in the second cache device, then: 

(1) performing a direct memory access operation into the related external memory 
device to read the second unit of information associated with the second index and 
store that second unit of information in the second cache device; and 

(2) communicating a second indication that the second unit of information 
associated with the second cache device is presently within the second cache 
device. 

12. A method for storing and retrieving units of information in a linked cache 
device having a first memory device and a second memory device, and a cache 
controller, the first and second memory devices each having units of information 
stored within that are a first and second subset, respectively, of units of 
information stored in a related external memory device, including the steps of: 

(a) receiving from an external control device a first index; 
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(b) searching the first memory device for a first unit of information associated 
with the first index ; 

(c) communicating a first indication that the first unit of information was found, 
if the first unit of information is presently stored within the first cache device; 



(d) receiving from the first memory device a second index embedded within the first 
unit of information; 

(e) searching the second cache device for a second unit of information associated 
with the second index ; 

(f) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(g) receiving from the cache controller the first and second indications that the 
first and second cache devices have found the first and second unit of information; 

(h) enabling outputting of the first and second unit of information in response to 
receiving both the first and second indications; and 

(i) performing a direct memory access operation to read information from the 
external memory device. 

13. A method for storing and retrieving units of information in a linked cache 
device having a first memory device and a second memory device, and a cache 
controller, the first and second memory devices each having units of information 
stored within that are a first and second subset, respectively, of units of 
information stored in a related external memory device, including the steps of: 

(a) receiving from an external control device a first index; 

(°) searching the first memory device for a first unit of information associated 
with the first index, the first unit of information being an exchange context; 

(c) communicating a first indication that the first unit of information was found, 
if the first unit of information is presently stored within the first cache device; 



(d) receiving from the first memory device a second index embedded within the first 
unit of information; 

(e) searching the second cache device for a second unit of information associated 
with the second index, the second unit of information being a port context; 

(f) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(g) receiving from the cache controller the first and second indications that the 
first and second cache devices have found the first and second unit of information; 



(h) enabling outputting of the first and second unit of information in response to 
receiving both the first and second indications; and 

(i) locking the first and second memory devices to prevent a particular unit of 
information from being altered when that unit of information is in use; 
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(j) associating the exchange context and the port context with a frame of data 
which is either being transmitted or received; and 

(k) activating a receive lock bit when the port context or the exchange context is 
being used in association with a received frame. 

14. A method for storing and retrieving units of information in a linked cache 
device having a first memory device and a second memory device, and a cache 
controller, the first and second memory devices each having units of information 
stored within that are a first and second subset, respectively, of units of 
information stored in a related external memory device, including the steps of: 

(a) receiving from an external control device a first index; 

(b) searching the first memory device for a first unit of information associated 
with the first index, the first unit of information being an exchange context; 

(c) communicating a first indication that the first unit of information was found, 
if the first unit of information is presently stored within the first cache device; 



(d) receiving from the first memory device a second index embedded within the first 
unit of information; 

< e ) searching the second cache device for a second unit of information associated 
with the second index, the second unit of information being a port context; 

(f) outputting a second indication that the second unit of information was found, 
if the second unit of information is present within the second cache device; 

(g) receiving from the cache controller the first and second indications that the 
first and second cache devices have found the first and second unit of information; 



(h) enabling outputting of the first and second unit of information in response to 
receiving both the first and second indications; and 

(i) locking the first and second memory devices to prevent a particular unit of 
information from being altered when that unit of information is in use; 

(j) associating the exchange context and the port context with a frame of data 
which is either being transmitted or received; and 

(k) activating a transmit lock bit when the port context or the exchange context is 
being used in association with a frame to be transmitted. 
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L9: Entry 1 of 1 File: USPT Nov 11, 2003 



DOCUMENT-IDENTIFIER: US 6647301 Bl 

TITLE: Process control system with integrated safety control system 
Detailed Description Text (23) : 

A Compiling Translator is computer executed software which converts and translates 
high-level source code language software (such as the Application Program) into 
Machine Operation Code by using techniques of software compiling, lexical analysis, 
comment removal, macro substitution, statement analysis with rules, intermediate 
object code generation, assembly code generation, linkage to standard subroutines 
in Machine Operation Code form, and/or binary machine code generation. 

Detailed Description Text (31) : 

In introduction of the control-computer-executed logic and the use of the 
aforementioned SEQUENCE object in a control computer program in the preferred 
embodiment, a brief review of two terms related to the art and technology of 
chemical engineering is in order. Traditional concepts of the "unit operation" (a 
particular kind of physical change procedure used and effected in chemical 
processing — for example, filtration, evaporation, distillation, or heat transfer) 
and the "unit process" (a particular kind of transformation in materials via 
chemical reaction — for example, oxidation, hydrolysis, esterif ication, 
polymerization, or nitration) define functional activities which must execute in 
order to effect a particular desired incremental transformation of a "starting" 
material to a "product" material. These physical changes and chemical 
transformations usually occur through use of an apparatus which, in operation, 
converts starting material into product material (in both a micro and macro 
context) ; and it is useful for that portion of the apparatus involved in the 
execution of at least one "unit operation" or "unit process" to be referenced and 
organized for control purposes in the described embodiment in a least one "Process 
Unit" (for example, a reactor, a fractionating tower, a crystallizer, a dryer, a 
tank farm, a distillation unit, an extraction unit, or an evaporator) . In an 
overall chemical manufacturing plant a number of sequential and parallel "unit 
operations" and "unit processes" are therefore effectively executed collectively in 
a respective set of "Process Units" to convert many instances of "starting 
materials" into "product materials" (where the "product material" from one "Process 
Unit" is very often the "starting material" for the next "Process Unit" when the 
situation is examined in a micro context) . Engineering and operations personnel 
have traditionally referenced and managed the operation of such a plant apparatus 
in the context of a set of such "Process Units" in effecting operation of these 
sequential and parallel "unit operations" and "unit processes" where 
"raw" ("starting") material (s) are converted into "finished" ("product") material 
(s) in a macro context. In example, a team of operating technicians might be 
"starting up" "reactor" "Process Units" in a plant even as the team is also "doing 
maintenance" on a first "distillation tower" "Process Unit" and "running" a second 
"distillation tower" "Process Unit" disposed to operate in parallel with the first 
distillation tower. But it needs to also be appreciated that the "Process Units" do 
not stand free in a physically separated and defined manner in an apparatus such as 
a chemical manufacturing plant (the target use of the preferred embodiment) because 
the chemical manufacturing plant usually enables internal fluid movement by 
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providing essentially one large apparatus built of conjoined vessels, pipes, pumps, 
valves, instruments, and wires. The "Process Unit" definitions are therefore, in 
verity, logical (in both cerebral and electrical-circuit-implemented contexts) and 
cultural rather than specifically and clearly physically defined. In example of 
this point, when a pipe with a valve connects a reactor to a surge tank, the 
reactor, pipe, valve, and surge tank are constructed as one unified and conjoined 
physical apparatus; one can conveniently consider the reactor as a first "Process 
Unit" and the surge tank as a second "Process Unit", but the issue of which Process 
Unit includes the valve (and the pipe) for management and control purposes requires 
acceptance of the fact that a technologically artistic and cultural decision must 
be made since (a) there is no clear and specific physical separation between the 
two "Process Units" within the valve and (b) bifurcated control of the valve at a 
particular moment of real-time is definitely not desirable. 

Detailed Description Text (52) : 

Turning now to consideration of FIG. 3A, a block diagram of a preferred embodiment 
of control computer 82 is shown as control computer 10. Control computer 10 
includes a Central Processor and Control Unit ("CPU") 12. In accordance with one 
embodiment herein, CPU 12 is based upon the MIPROC processor from Radstone 
Technology PLC, a Harvard architecture processor. However, it should be appreciated 
that other CPU circuits or microprocessors are conditionally used, and that the 
principles of the present invention are not limited to any particular CPU 
construction or integration. It should also be appreciated that all of the circuits 
in computer 10 are integrated into a single microcomputer chip in the appropriate 
application as might be achieved in a contemplated embodiment via use of ASIC 
(Application Specific Integrated Circuitry) technology (in conjunction with 
contemplated appropriate approval by the certifying agency as an acceptable 
certified embodiment) . 
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File: USPT 



Sep 17, 2002 



DOCUMENT- IDENTIFIER : US 6453356 Bl 
TITLE: Data exchange system and method 



Brief Summary Text (15): 

^ ^Process monitoring, tracing, and logging are provided to track the progress of data 

passing through the data exchange engine and to detect and correct processing 
errors. In the case of a processing anomaly, the data exchange engine effects a 
rollback of failed transactions to preclude the loss of data. Performance 
statistics may also be provided. 

Detailed Description Text (39) : 

Also shown in FIG. 9 is a statistics monitor module 264 and an associated 
statistics log 276 which are used to provide monitoring and tracking of data as it 
moves through the data exchange system. The statistics monitor module 264 also- 
provides historical performance information on queues and historical information on 
system resource usage. As will be described in greater detail hereinbelow, the 
statistics monitor module 2 64 provides a means for logging and tracing a given 
application. Logging reveals the state of the application at the time of an error, 
while tracing provides a description of all software events as they occur. The 
tracing information may be used for tracking the application, state, and other 
related operations. The tracing information may be used in conjunction with the 
logging information to determine the cause for an error since it provides 
information about the sequence of events prior to an error. 

Detailed Description Text (118) : 

The logging and tracing utility provides for logging and tracing during execution 
of a component. As discussed previously, logging reveals the state of the component 
at the time of an error, while tracing provides a description of all software 
events as they occur. The tracing information may be used for tracking the 
component-state, and other related operations. It may be used in conjunction with 
the logging information to determine the cause of an error, as it provides 
information about the sequence of events prior to an error. 

CLAIMS : 

7. The method of claim 1, further comprising tracking each of the data streams 
during converting, identifying, or transforming operations. 

17. The method of claim 10, further comprising tracking the data during converting, 
identifying, or transforming operations. 

25. The method of claim 19, further comprising: tracking processing of the 
information having the generic format; and logging errors occurring during the 
processing of the information having the generic format. 

49. The medium of claim 44, further comprising tracking the data during converting, 
identifying, or transforming operations. 
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56. The system of claim 51, further comprising means for tracking the data during 
converting, identifying, or transforming operations. 
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L4: Entry 1 of 1 



File: USPT 



Sep 17, 2002 



DOCUMENT- IDENTIFIER : US 6453356 Bl 
TITLE: Data exchange system and method 



Detailed Description Text (51) : 

The Common Base Class is an abstract base from which the Common Object Class is 
derived. An inheritance tree graphically depicting the Common Base Class is shown 
in FIG. 17. The main purpose of this class is to provide a single object naming and 
typing mechanism to aid in object tree searches and traversals. The Common Base 
Class is characterized in the following code-level example. 

Detailed Description Text (61) : 

The following Common Object retrieval methods are used internally by the 
GetAttributeValue ( ) and SetAttributeValue ( ) methods to search the attribute list 
\ of a Common Object and to locate a specific Common Attribute instance. These 
' retrieval methods may be used by application developers as well. Each of these 
^methods require a fully dot ( . ) delimited Distinguished Name and will recursively 
walk all relative levels of nesting to retrieve the relevant object/attribute. 

1 Detailed Description Text (221) : 
Dequeue! ) attempts to find a queue object marked as NORMAL_OBJECT first from the 
memory buffer. If it can not find one, it: will try to find one from the queue 
files. If Dequeue ( ) finds a queue object marked as NORMAL_OBJECT in a file, it 
creates a queue object, marks it as ACTIVE_OBJECT, inserts it into the memory 
tuffer, and de-serializes the Common Object it refers to and returns the Common 
Object to the caller. During this process, if Dequeue ( ) can not de-serialize the 
Common Object, it will mark the queue object as DELETED_OBJECT in the queue file 
knd continue its search. 
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L7: Entry 1 of 1 



File: USPT 



Sep 17, 2002 



DOCUMENT-IDENTIFIER: US 6453356 Bl 
TITLE: Data exchange system and method 



Detailed Description Text (113) : 

Two macros, DX_SYSINIT and DX_SYSEXIT, are used to manage initialization and 
destruction of the DX_SysConf igOb ject, respectively. A usage example of these two 
macros is given as follows: 

Detailed Description Text (125): 

In order to write a message into the log/trace file, the developer may use the 
macro DX_TL as shown below: DX_TL (DX_ARGS, Category, StringToLog/ErrorNumber [ , arg 1 
[,arg2]]); 

Detailed Description Text (126) : 

The macro DX_ARGS includes parameters such as filename, line number, time and 
thread ID that are automatically written into the trace/log messages. Category is 
specified by the following enumerated data types: 

Detailed Description Text (134): 

An error/event may occur at a very low level in the code (e.g., database space 
exhausted) . It is important to report this low level event, but it is also 
important to report the context of what was trying to be achieved within the 
application when this low level error occurred. The application developer is 
provided with macros to define a context within the developer's code. The set of 
macros provided for this purpose include: INIT_CONTEXT; CONTEXT_BEGIN; and 
CONTEXT_END. In general, every function using the context macros should first use 
the macro INIT_CONTEXT . It is noted that, if INIT_CONTEXT is not called before 
defining CCNTEXT_BEGIN, the code may not compile. 

Detailed Description Text (135) : 

The beginning of a context may be defined using the macro CONTEXT_BEGIN, and the 
end of a context can be defined using the macro CONTEXT_END, as is indicated in the 
following example. The CONTEXT_BEGIN macro takes the argument Context Number. This 
context number is used to access the Context Catalog of an application and to 
retrieve the context string. It is noted that nested contexts are generally not 
allowed. If a CONTEXT_BEGIN is called before the previous context is ended, an 
implicit CONTEXT_END for the previous context is assumed. The following example is 
provided: 

Detailed Description Text (140) : 

Within a given function, INIT_CONTEXT declares a pointer to a DX_ContextOb j ect, 
referred to as dx_context, and initializes it to point to a global dummy 
DX_ContextOb ject, whose context string is blank. It also declares and initializes a 
variable dx_init_context . The definition of the INIT_CONTEXT macro is as follows: 

Detailed Description Text (142) : 

The macro CONTEXT_BEGIN, described in the following example, checks whether 
dx_init_context is initialized or not. The significance of this check is to make 
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sure that the function does not compile if INIT_CONTEXT is not called before the 
first occurrence of CONTEXT_BEGIN . It then initializes the DX_ContextObject pointer 
to point to a new DX_ContextOb ject instance storing the context string specified by 
the context number argument. 

Detailed Description Text (144) : 

The macro CONTEXT_END deletes the DX_ContextOb ject instance created by 
CONTEXT_BEGIN, as can be seen in the following example. 

Detailed Description Text (169) : 

As in the case of run-time configuration management, the Reconf igParameters ( ) 
function on DX_SysConf igOb j ect will be called. In this function, the 
DX_SysConfigObject first checks if the signal/event received corresponds to 
Shutdown and if the PID specified is its own PID. If so, it, in turn, must make 
sure that no new transactions are started, and waits for all of the current 
transactions to be completed. This involves calling the macro DX_SYSEXIT. It is 
noted that, before shutting down, the entry in the configuration file should be 
deleted by the exiting process. It is possible that the component aborts prior to 
cleaning up the configuration file. This stray entry does not effect the start up 
of any other component using the same configuration file. DX_ConfigSet is also 
responsible for clean up of stray DX_SHUTDOWN entries in the configuration file. 

Detailed Description Text (181): 

A macro called DX_Thread_Execute ( ) is provided for ease of use. This macro 
retrieves the DX_ThreadController instance from the DX_SysConf igOb j ect and then 
invokes the DX_ThreadCont roller :: Execute ( ) method. The method 

DX_ThreadController :: Execute ( ) behaves exactly the same as if a call was invoked 
to create a new thread. A pointer must be passed to the function and as well as a 
pointer to the arguments. Internally, the DX_ThreadController uses the class 
DX_ThreadRequest when a thread is not available to provide a FIFO buffer that will 
store the function pointer and argument pointer. Each time a thread completes 
execution, the FIFO is checked for the presence of entries. If there are entries in 
the FIFO, the first entry in the buffer is removed and executed. An example of 
DX_ThreadController implementation is provided in the following example: 
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TITLE: Data exchange system and method 

Drawing Description Text (14) : 

FIG. 17 is an illustration of an inheritance tree graphically depicting a Common 
Base Class associated with the Common Object shown in FIG. 15; 

Drawing Description Text (15) : 

FIG. 18 is a class structure diagram showing public and non-public interfaces 
associated with various file based and database based queuing processes; 

Drawing Description Text (16) : 

FIG. 19 is a pictorial description of the calling structure between data exchange 
queue classes when dequeueing a Common Object in accordance with one embodiment of 
the present invention; and 

Drawing Description Text (17): 

FIG. 20 is a pictorial description of the calling structure between data exchange 
queue classes when enqueueing a Common Object in accordance with the embodiment of 
FIG. 19. 

Detailed Description Text (46) : 

A Common Object within the context of this illustrative embodiment represents a C++ 
container object that is used to contain multiple portions of attribute data within 
a single flexible object. A Common Object comprises two lists, a Private List and a 
Public List. Both lists contain one or more attribute/value pairs (AV pairs) . These 
AV pairs represent objects referred to as Common Attribute Objects, each of which 
comprises an attribute name, value, and type. Common Attribute classes are 
available for all of the basic types plus some complex types as well. 

Detailed Description Text (47) : 

The Public List represents a sequence of two types of user-defined attributes, 
which are instances of an Attribute class or a Common Object. The Private List 
contains attributes that are used internally by the system for a variety of 
purposes, such as naming, routing, and identifying ancestry information. The Public 
List contains data that the user has defined, which may include other Common 
Objects. Each list may contain two types of attributes, type specific AV pairs or 
objects. Contained objects may either be List Objects or other Common Objects. The 
Private List does not include contained objects since this List is only used for 
simple tags or header information. 

Detailed Description Text (51) : 

The Common Base Class is an abstract base from which the Common Object Class is 
derived. An inheritance tree graphically depicting the Common Base Class is shown 
in FIG. 17. The main purpose of this class is to provide a single object naming and 
typing mechanism to aid in object tree searches and traversals. The Common Base 
Class is characterized in the following code-level example. 
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Detailed Description Text (79) : 

The use of reference counting greatly reduces the amount of time and memory that is 
required to copy objects. Use of a third-party foundation class library, such as 
one developed by Rogue Wave Software, Inc., automatically supplies a number of the 
copy constructors. Also, methods within the DX_CommonOb j ect class itself make use 
of the Rogue Wave copy methods as well. It is noted that the DX_StringAttribute and 
DX_BlobAttribute classes provide their own copy optimization through reference 
counting, as objects of these classes could be of a substantial size. 

Detailed Description Text (80) : 

The Common Attribute is an object that is contained within a Common Object and is 
used to contain attribute data. The Common Attribute contains a private attribute 
that denotes the specific attribute type and a set of public attributes for name 
and value. A Common Attribute may be a simple attribute of a specific data type, 
name, and value or it may contain another object, such as a List Object or Common 
Object. The type specific Common Attribute classes all inherit their capabilities 
from the generic Common Attribute class so all classes will behave in an equivalent 
manner. Reference is made to FIGS. 16A-16D, which illustrate the contents of a 
Common Attribute when used to represent some of the supported data types. 

Detailed Description Text (81) : 

The Common Attribute class is an abstract base class from which type specific 
Attribute classes are derived, a characterizing example of which is given as 
follows : 

Detailed Description Text (88) : 

Code-level examples of various type specific attribute classes that are supported 
are provided as follows: 

Detailed Description Text (91) : 

EXAMPLE #20 class : DX_String The DX_String class is a reference counted container 
class used by the DX_StringAttribute to store the attributed value. It provides 
the user a way to keep down the overhead associated with having to copy the data. 

Detailed Description Text (103): 

As was discussed previously, a List is a sequence of attributes, i.e., instances of 
attribute classes and/or Common Objects, an example of which is given as follows: 

Detailed Description Text (112) : 

When initiated, a component creates an instance of a System Configuration Object 
(DX_SysConf igObject ) that stores the current parameter settings. The component also 
registers for a Signal/Event so that it is informed of changes to the configuration 
using the dynamic configuration command line interface/web interface. When a user 
wants to change the run time parameters of a component (identified by the process 
ID and the machine on which it is running), a signal/event is sent to the component 
to update its configuration. A signal/event handler invokes the Reconf igParameters 
( ) method on the DX_SysConf igObj ect, which takes care of reconfiguring the various 
controller objects, such as DX_TraceLogOb ject, DX_QueueManager, and 
DX_ThreadController for example. The System Configuration object, 
DX_SysConf igObject, is a singleton object that initializes and controls the 
configuration of a component in the data exchange system, such as logging/tracing 
levels, thread controller, queuing, and performance monitoring. A singleton object, 
as understood in the art, refers to a C++ nomenclature meaning that only a single 
instance of the class may exist within a single executable. A singleton object is 
most commonly used when controlling system wide resources. 

Detailed Description Text (114) : 

EXAMPLE #34 #define DX_SYSINIT ( Component Name) .backslash. 

DX_SysConf igObject : : Instance (ComponentName) ; .backslash. RegisterForEvent ( ); 
#define DX SYSEXIT EndWaitForEvent ( ); .backslash. 
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DX_SysConf igObject : : Deletelnstance ( ); class DX_SysConf igOb ject { public: // To 
ensure that only one instance of the System Config object // exists, one has to 
always use this function to obtain a // reference to the // system config object // 
Cannot delete the pointer returned, use // DX_SYSEXIT if you want to delete // the 
sysconf igob ject static DX_SysConf igOb j ect* Instance (char *componentName-0) ; // 
static method to delete the instance of the singleton object static void 
Deletelnstance ( ); // Called when the parameters are to be changed to run time void 
Reconf igParameters ( ); // used to find parameter values by name from the 
paramValueList // do not delete the pointer returned by this function char 
*FindValue (char *name) ; private: // constructor is private to make sure the user 
cannot instantiate DX_SysConf igOb j ect (char *componentName) ; // 

destructor . about . DX_SysConf igObject ( ); // to read the configuration file and 
initialize the // paramValueList EreturnCodes InitParamValueList (char *dx_home, 
char** outCf gFileName) ; void GetT race Log Pa rams (DX_INDICATOR *trcLogCategoryInd, 
char* trcLogDir, long ilogSize) ; // a pointer that stores the one and only instance 
of the // system config object static DX_SysConf igOb j ect* instance; // the list of 
various configuration parameters //DX_ListObject *paramValueList; 

RWDlistCollectables *paramValueList ; char Component Name [MAX_NAME_LEN] ; // Pointer 
to DX_ThreadController instance DX_ThreadController *PthrCtrl; // Method to 
instantiate DX__ThreadController // after DX_SysConf igOb j ect constructor 
EreturnCodes InitThreadController ( ); void DeleteThreadController ( ); void 
DeletePMonitor ( ); // Pointer to DX_Monitor instance DX_Monitor *pMonitor; // 
Method to instantiate DX_Monitor after // DX_SysConf igObject constructor 
EreturnCodes InitMonitor (const char *appName) ; EreturnCodes InitTraceLogObject 
(char* component Name) ; // Get Monitor config parameters void GetMonitorParam( struct 
MonConf igType *monConfig) ; //To access the DX_ThreadController object 
DX_ThreadController* GetThreadController ( ); DX_Mutex* paramListLock; }; 

Detailed Description Text (157) : 

The Monitor Object is instantiated by the DX_SysConf igObject instance or by calling 
the static method DX_Monitor :: Instance ( ) directly. There is only one DX__Monitor 
thread running per executable component. The monitor thread is spawned whenever the 
MONITOR in the system configuration file is triggered to ON. The monitor thread 
exists until the MONITOR is triggered to OFF. The implementation of the monitor 
impacts three areas. The Queue Manager provides the queue performance data. The 
DX_SysConf igObject provides the configuration change handling and the monitor 
object instantiation. The DX_Monitor Object spawns or kills the monitor and 
generates the monitor log files or log table in the database. The methods added in 
the DX_Queue classes are listed below: 

Detailed Description Text (161) : 

An additional method added in the DX-Queue classes is provided as follows: 
Detailed Description Text (175) : 

The queue viewing administration and maintenance utility will now be described. 
DX_Qview <queue_name> [priority] permits viewing of all items in a Queue identified 
by its name. This information may be obtained using the GetQueueView ( ) method on 
DX_QueueManager object. DX_GetCO <oidval> permits viewing of the common object for 
a particular OID in the queue. This utility uses the Demarshal ( ) method provided 
in DX_CommonObject class . Other viewing options include the following: viewing the 
queue entry corresponding to common object specified by its name or OID; viewing 
all queue entries enqueued by a particular source; viewing all queue entries having 
a particular status; and viewing the names of all objects in the queue. 

Detailed Description Text (181) : 

A macro called DX_Thread_Execute ( ) is provided for ease of use. This macro 
retrieves the DX_ThreadController instance from the DX_SysConf igObj ect and then 
invokes the DX_ThreadCont roller :: Execute ( ) method. The method 

DX_ThreadCont roller :: Execute ( ) behaves exactly the same as if a call was invoked 
to create a new thread. A pointer must be passed to the function and as well as a 
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pointer to the arguments. Internally, the DX_ThreadController uses the class 
DX_ThreadRequest when a thread is not available to provide a FIFO buffer that will 
store the function pointer and argument pointer. Each time a thread completes 
execution, the FIFO is checked for the presence of entries. If there are entries in 
the FIFO, the first entry in the buffer is removed and executed. An example of 
DX_ThreadController implementation is provided in the following example: 

Detailed Description Text (183) : 

A DX_Utils library provides the DX_Mutex class for platform independent mutex 
protection, an example of which is provided below. The DX_Mutex class does not 
require use of the DX_ThreadController class . 

Detailed Description Text (185) : 

For purposes of data internationalization, Unicode UTF-8 formatting is provided to 
store all attribute value strings using wide character strings. Code conversion 
functions to convert a Unicode string to UTF-8 string and vice-versa are also 
provided. These conversion methods are used to store any user-specified data 
internally in UTF-8 format. To support language localization, all message strings 
use an external message catalog. The interface provided by the DX_CodeConversion 
class is as follows: 



Detailed Description Text (197) : 

As was discussed previously, two types of priority based queues are used, namely, 
the incoming Receive Queues and the outgoing Send Queues. Each outgoing adapter 
will have its own outgoing queue so that any interface specific translation or 
routing may be performed outside the engine core. Each instance of the DX_Engine 
executable has one or more input queues, although only one is allowed for file- 
based queues, and one or more output queues. An instance of the DX_QueueManager 
class is used as a central proxy to all queue access, and will be mutex protected 
and record-lock protected, for file-based implementation, or row lock protected, 
for database implementations, to prevent data contention. 

Detailed Description Text (202) : 

The Queue Manager public interface makes use of the DX_QueueTransaction object for 
transaction control. The Enqueue ( ), Dequeue ( ), Commit ( ), and Rollback ( ) methods 
take pass-in argument of an instance of the DX_QueueTransaction class which belongs 
to a running thread. The transaction object contains an ordered list of operations 
performed in this transaction. For file-based implementations, all operations are 
maintained in buffered memory and are not written into file storage until commit 
time. For database implementations, the database provided rollback mechanism is 
employed, with each transaction using its own unique run-time database context. 

Detailed Description Text (203) : 

Tne class structure diagram is shown in FIG. 18. The public interface, shown in 
shaded boxes, is used by adapter developers, but the non-public interface, shown 
without shading, is intended for internal use only. This usage restriction is 
forcefully implemented. 

Detailed Description Text (206) : 

The DX_QueueManager class is a singleton class and acts as the global access point 
of all queue operations. It contains a list of queues, instances of DX_Queue class, 
as its data member. Users, however, do not need to create a queue with Queue 
Manager before using it. The creation of the queue is embedded in Enqueue and 
Dequeue operations. Besides the Enqueue ( ) and Dequeue ( ) operations, 
DX_QueueManager also defines interfaces for extended transaction support, 
performance monitoring, and queue administration. An illustrative example of 
DX_QueueManager implementation is provided as follows: 

Detailed Description Text (210) : 

DX_Queue is an abstract interface class . It only provides an interface for 



http7/westbrs:9000/bin/gate.exe?f=doc&state=8f8002.5.1&ESNAME=KW 2/24/04 



Enqueue/Dequeue operations and Commit /Rollback operations. An implementation 
example is given as follows: 

Detailed Description Text (212) : 

DX_FileQueue class contains four child queues for each priority. Besides the queue 
operation interface and transaction interface, the algorithm of priority handling 
is also implemented in this class . The priority algorithm is implemented inside the 
Dequeue method. The DX_IndexOb ject argument of the Dequeue methods is used for 
transaction control. At running time, Dequeue operations fill in corresponding 
fields in DX_IndexOb j ect, which is a component of DX_QueueOperation object. An 
implementation example is given as follows: 

Detailed Description Text (217) : 

A logical dequeue cursor is also defined and manipulated in this class . For 
purposes of performance, this cursor should never be rolled back. This feature is 
also implemented via the help of DX_QueueOb jectList class as well. 

Detailed Description Text (225) : 

The DX_IndexObject class is important for all the queue operations. It is used to 
uniquely identify the location at which a Common Object is stored, and is further 
used for internal routing of all queue requests. The DX_IndexObject class is member 
of the DX_QueueObject class and DX_QueueOperation class . Its members include queue 
name, queue priority, storage type, file handle, file index, record offset, and 
record sequence. An illustrative example of a DX_IndexOb ject implementation is 
provided as follows: 

Detailed Description Text (232) : 

Objects are stored and retrieved from a persistent store in an efficient manner. 
Two types of object storage, termed relational database storage and file storage, 
are implemented for this purpose. The object persistency is implemented using 
Marshal ( ), Demarshal ( ), and DeleteStorage ( ) methods of the DX_CommonOb ject 
class, where a parameter is passed to indicate storage type. A policy for object 
caching and sharing may also be employed. 

Detailed Description Text (237) : 

It is noted that when serializing and de-serializing an object in a Java.TM. 
environment, the Rouge Wave Jtools Java.TM. foundation classes may be used since 
they project a broad set of base classes and are compatible with the Rogue Wave 
Tools. h++ classes used in C++. 

Detailed Description Paragraph Table (1) : 

class : DX_CommonBase : public RWCollectable { RWDECLARE_COLLECTABLE (DX_CommonBase) ; 
protected: char* Name; int Type; void SetName (const char* inName) ; public: char* 
GetName ( ) { return name; } int GetType ( ) { return type; } virtual void 
PrintContents ( ) =0; char* GetTypeName (int type); }; 

Detailed Description Paragraph Table ( 2 ) : 

enum EclassTypes { DX_OBJECT = 0, DX_ATTRIBUTE, DX_LIST, DX_MULTIVALUE, DX_INTEGER, 
DX_STRING, DX_S T R I NGVAL , DX_DATE, DX_TIME, DX_REAL, DX_BLOB, DX_BLOBVALUE , 
DX_VERSION, DX OID CLASS }; enum EreturnCodes { NOT_FOUND = 0, SUCCESS, FAILED, 
TIME_OUT, ACCESS_DENIED, DUPLICATE_OB JECT, NO_OBJECT, NO_ATTRIBUTE , 
I NVAL I D_ATT R V AL , I NVAL I D_ARGS , INVALID_OPERATION, SYSTEM_ERROR }; enum 
EstorageTypes { FLAT FILE = 0, DATABASE }; enum ElistType { PUBLIC = 0, PRIVATE }; 
static const int MAX_NAME_LEN=2 55 ; static const int MAX_DOTTED_NAMELEN— 8 0 96; static 
const int MAX_MULTI_VAL=255; static const int MAX_BLOB_SIZE=2048 ; static const int 
OID_LEN=128; static const int MAX_FILE_NAME=256; static const int MAX_PID_LEN=12 ; 
static const int MAX_LINE_SIZE=100; 

Detailed Description Paragraph Table (3) : 

c lass : DX_CommonObject : public DX_CommonBase { RWDECLARE_COLLECTABLE 
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(DX_CommonObject) ; friend class DX_ListObj ect; friend class DX_FileSubQueue; 
public: virtual . about . DX_CommonObj ect () ; // 

Constructors //I 
name==NULL, the name is set to "NOT_SET". //If name is ""or contains a " . " it will 
be set to 

" INVALID NAME " /***************************************^ 
DX_CommonObj ect (const char* 

name=0) ; /**************************************^ / / /Creat 

a copy of an entire DX_CommonOb j ect, //but with it's own unique OID 
value /*******************************^ 
DX_CommonOb j ect* Clone 

() ; /********************************** //All 
AddAttribute and AddPvtAttribute methods take ownership of //the pointers passed in 
to them. Do NOT delete the pointers after //a call to AddAttribute and 
AddPvtAttribute. The pointers will be //deleted when the container DX_CommonOb j ect 
or DX_ListObject is deleted. // //NOTE: When using the following two methods for 
creating a DX_STRING attribute // // AddAttribute (const char* name, const int type, 
const void* value) // and AddPvtAttribute (const char* name, const int type, const 
void* value) // // the value is defaulted to be of type const char* and not 
UNICHAR* // Misuse will lead to unexpected 
behavior . 

EreturnCodes AddAttribute (DX_CommonAt tribute* newAttr) ; EreturnCodes AddAttribute 
(DX_ListObject* newAttr) ; EreturnCodes AddAttribute (DX_CommonOb j ect* newAttr); 

EreturnCodes AddAttribute (const char* name, const int type, const void* value); 

EreturnCodes AddPvtAttribute (DX_CommonAt tribute* newAttr); EreturnCodes 

AddPvtAttribute (DX_ListOb ject* newAttr) ; EreturnCodes AddPvtAttribute 
(DX_CommonObject* newAttr); EreturnCodes AddPvtAttribute (const char* name, const 

int type, const void* 

value) ; /*********************************^ / / /Delete 

and DeletePvtAttribute will remove the named attribute //from the container 
DX_CommonObj ect or DX_ListOb j ect and delete the named //attribute's 
pointer /******************************* 

EreturnCodes DeleteAttribute (const char* name); EreturnCodes DeletePvtAttribute 
(const char* 

name ) ; /************************************^ / /RemoveA 

will remove the named attribute from the container //DX_CommonOb j ect or 
DX_ListObject but will not delete the named //attribute's 
pointer /***********************************^ 

EreturnCodes RemoveAttribute (const char* name); EreturnCodes Remove PvtAt tribute 
(const char* 

name); //Do NOT 

delete the pointer that is returned to you, it still is owned by //the container 
DX_CommonObject or DX_ListOb j ect . You CAN use the any of //the attribute class 
methods for the 

pointer . /************************************^ void* 
GetAttribute (const char* name); void* GetPvtAttribute (const char* name); void 
PrintContents ( ) ; //The caller of Demarshal is responsible for object's memory 
allocation static DX_CommonOb j ect* Demarshal (char* ObjOid, int type, int 
Contextlndex) ; EreturnCodes Marshal (int type, int Contextlndex) ; static 
EreturnCodes DeleteStorage (const char* oidVal, int type, int Contextlndex); //The 
caller of GetOID is responsible for deleting the memory of the returned char* char* 
GetOIDO; EPriorityCode GetPriority ( ) ; protected: void* Find(const char* name); 
static EreturnCodes RestorePersistentObject (const char* oidVal) ; static 
EreturnCodes DeletePersistentOb ject (const char* oidVal, int type); private: //Data 
Members DX_ListOb ject* PrivateList; DX_ListOb j ect* PublicList; static void Delete 
(RWDlistCollectables* list); static int Copy (RWDlistCollectables* dest, const 
RWDlistCollectables *src) ; static EreturnCodes CopyPersistentObject (char* 
filename); static EreturnCodes CopyFile (char* filename, char* copyf ilename) ; void 
saveGuts (RWFile& file) const; void saveGuts (RWvostream& stream) const; void 
restoreGuts (RWFile& file); void restoreGuts (RWvistream& stream); const int check ( ) 
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const; DX_Boolean updateListOids (DX_ListOb ject* Plist, DX_OID *Poid) ; DX_Boolean 
updateObjectOids (DX_CommonOb ject* Pob j , DX_OID *Poid) ; DX_Boolean ValidateNaming 
(const char* name); #if def DX_DATABASE EreturnCodes InsertlntoTable (char* filename, 
int Contextlndex) ; static EreturnCodes Ret rieveFromTable (char* filename, char* oid, 
int Contextlndex); EreturnCodes UpdateTable (char* filename, int Contextlndex); 
#endif //use to cal. the number of bytes needed to store an object using RWFile 
RWspace binaryStoreSize ( ) const;}; }; 

Detailed Description Paragraph Table (5) : 

class : DX_OID: public DX_ComrnonAttribute { RWDECLARE_COLLECTABLE (DX_OID) ; friend 
class DX_SysConf igOb ject; friend class DX_CommonObj ect; friend class DX_ListOb ject; 
protected: DX_OID(char* value); DX_OID(const char* name, char* value); public: // 
constructors DX_OID(); virtual . about . DX_OID 

copy of type specific value. User should new a pointer for the // return value and 

then delete the pointer after 

use /************************** ********^ 

EreturnCodes GetAttributeValue (void* value); EreturnCodes GetAttributeValue (char* 
value); void PrintContents ( ) ; private: static unsigned long OIDCOUNT; #ifndef HPUX 
static DX_Mutex lockOidCount ; #endif char *AttrValue; char *storeFileName; char 
*storeDirName; void saveGuts (RWFile& file) const; void saveGuts (RWvostream& stream) 
const; void restoreGuts (RWFile& file); void restoreGuts (RWvistream& stream); 
EreturnCodes SetAttributeValue (void* value); char* GetDirName ( ) ; char* GetFileName 
(); static DX_Boolean CheckOIDFormat (char* oidVal) ; }; 

Detailed Description Paragraph Table (6) : 

class : DX_CommonAt tribute : public DX_CommonBase { public: virtual EreturnCodes 
GetAttributeValue (void* value) = 0; virtual EreturnCodes SetAttributeValue (void* 
value) = 0; virtual void PrintContents () =0; }; 

Detailed Description Paragraph Table ( 7 ) : 
attribute class : 

Integer At tribute 

* CLASS NAME : DX_IntegerAttribute * INHERITED FROM : None * INHERITS : 
DX_CommonAttribute * DESCRIPTION : Provides storage for integer attributes 

class 

DX_IntegerAttribute: public DX_CommonAt tribute { RWDECLARE_COLLECTABLE 
(DX_IntegerAttribute) ; 

public : /****************************** / / / 

If name==NULL, the name is set to "NOT_SET". // If name is ""or contains a " . " it 
will be set to "INVALID 

NAME " /******************************** 

DX_IntegerAttribute (const char* name=0); DX_IntegerAttribute (const char* name, int 
value) ; virtual . about . DX_IntegerAttribute 

// The memory 

for the value argument is allocated and deallocated // by the caller. The library 
function GetAttributeValue just writes // to the value argument and the 
SetAttributeValue just reads the // argument to reset the value of the 
attribute /********************************^ 

EreturnCodes GetAttributeValue (void* value);// argument assumed to be int* 
EreturnCodes GetAttributeValue (int& value); EreturnCodes SetAttributeValue (void* 
value);// argument assumed to be int* EreturnCodes SetAttributeValue (int value); 
void PrintContents () ; private: int AttrValue; void saveGuts {RWFile& file) const; 
void saveGuts (RWvostream& stream) const; void restoreGuts (RWFile& file); void 
restoreGuts (RWvistream& stream); }; 

Detailed Description Paragraph Table (8) : 
attribute class : 
DX St ringAt tribute 

* CLASS NAME : DX_StringAttribute * INHERITED FROM : None * INHERITS : 
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DX_CommonAttribute * DESCRIPTION : Provides storage of strings. 

class 

DX_StringAttribute : public DX_CommonAttribute { RWDECLARE_COLLECTABLE 
(DX_StringAttribute) ; 

public* // if 

name-=NULL, the name is set to "NOT_SET" . // If name is "" or contains a " . " it 
will be set to "INVALID 
NAME " 

DX_StringAttribute {const char* name=0) , DX_StringAttribute (const char* name, const 
char* value); DX__StringAttribute (const char* name, const UNICHAR* value); 
DX_StringAttribute (const char* name, const DX_String& value); 
virtual . about . DX_StringAttribute 

jj- /***************************************^ // Get a copy 

of char* value. The library will allocate the // appropriate storage, the user 
should delete the pointer after 

use. /**********************************^ EreturnCodes 
GetAttributeValue (void* value);// assumes char** is passed // Get a copy oftype 
specific value. The library will allocate the // appropriate storage, the user 
should delete the pointer after 

use EreturnCodes 
GetAttributeValue (char* &value) ; EreturnCodes GetAttributeValue (UNICHAR* &value) ; 
EreturnCodes GetAttributeValue (DX_St ring* 

lvalue); J/ The 

memory for the value argument is allocated and deallocated // by the caller. The 
library functions just read the value // argument to reset the value of the 
attribute. // NOTE: the method taking (void* value) assumes const 

char* EreturnCodes 
SetAttributeValue (void* value); EreturnCodes SetAttributeValue (const char* value); 
EreturnCodes SetAttributeValue (const UNICHAR* value); EreturnCodes 
SetAttributeValue (const DX_String& value); void PrintContents ( ) ; private: 
DX_String* AttrValue; void saveGuts (RWFile& file) const; void saveGuts (RWvostream& 
stream) const; void restoreGuts (RWFile& file); void restoreGuts (RWvistream& 
stream) ; } ; 

Detailed Description Paragraph Table (9) : 

class DX_String : public RWCollectable { RWDECLARE_COLLECTABLE (DX_String) friend 
class DX_StringAttribute; public: DX_String ( ) ; DX_String (const char* value); 
virtual . about . DX_St ring () ; void PrintContents () ; private: char* StringValue; int* 
len; int* refCount; DX_String& operator= (const DX_String& rhs) ; void saveGuts 

(RWFile& file) const; void saveGuts (RWvostream& stream) const; void restoreGuts 

(RWFile& file); void restoreGuts (RWvistream& stream); }; 

Detailed Description Paragraph Table (10) : 
attribute class : 
DX Da teAt tribute 

* CLASS NAME : DX_DateAttribute * INHERITED FROM : None * INHERITS : 
DX_CommonAttribute * DESCRIPTION : Provides storage for Date attributes * * NOTE: 
This class only stores the date related fields of the struct tm* * The time related 
fields are initialized to 0, any data passed in * the struct tm* time related 
fields will be discarded. 

class 

DX_DateAttribute : public DX_CommonAttribute { RWDECLARE_COLLECTABLE 
(DX_DateAttribute) ; 

public: // if 

name==NULL, the name is set to "NOT_SET". // If name is ""or contains a " . " it will 
be set to 
" INVALI D NAME " 

DX_DateAttribute (const char* name=0) ; DX_DateAttribute (const char* name, const 
struct tm value) ; virtual DX_DateAttribute 

(); /************************************^ // The struct 
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tm passed as an argument for GetAttributeValue // is to be allocated and 
deallocated by the caller // The library function just copies the value into the // 
structure /************************************^ 

EreturnCodes GetAttributeValue ( struct tm& value); EreturnCodes GetAttributeValue 
(void* value) ; // assumes struct tm* is 

passed /*******************************^ // The 

memory for the value argument is allocated and deallocated // by the caller. The 
library functions just read the value // argument to reset the value of the 
attribute /********************* ************^ 

EreturnCodes SetAttributeValue ( void * value); EreturnCodes SetAttributeValue (const 
struct tm value); void PrintContents ( ) ; private: RWCollectableDate *AttrValue; void 
saveGuts (RWFile& file) const; void saveGuts (RWvostream& stream) const; void 
restoreGuts (RWFile& file); void restoreGuts (RWvistream& stream); }; 

Detailed Description Paragraph Table (11) : 
attribute 

class * DX TimeAt tribute **************^ 
** CLA5S~ NAME : DX_TimeAt tribute * INHERITED FROM : None * INHERITS : 
DX_CommonAttribute * DESCRIPTION : Provides storage of time values in form of SSE . 
★★★*****★*★*★★★★★*★★*★★★***★★★★★★★ class 

DX_TimeAttribute : public DX_CommonAt tribute { RWDECLARE_COLLECTABLE 
(DX_TimeAttribute) ; 

name==NULL, the name is set to "NOT_SET". // If name is ""or contains a " . " it will 
be set to 

"INVALID NAME " /************************************ 

DX_TimeAttribute (const char*name=0 ) ; DX_TimeAttribute (const char*name, unsigned 
long value); virtual . about . DX_TimeAttribute 

(); z**********************************^ // The memory for 

the value argument is allocated and deallocated // by the caller. The library 
function GetAttributeValue just writes // to the value argument and the 
SetAttributeValue just reads the // argument to reset the value of the 
attribute // argument 

assumed to be unsigned long* EreturnCodes GetAttributeValue (void* value); 
EreturnCodes GetAttributeValue (unsigned long& secondSinceEpoch) ; // argument 
assumed to be unsigned long* EreturnCodes SetAttributeValue (void* value); 
EreturnCodes SetAttributeValue (const unsigned long value); void PrintContents () ; 
private: RWCollectableTime* AttrValue; void saveGuts (RWFile& file) const; void 
saveGuts (RWvostream& stream) const; void restoreGuts (RWFile& file); void 
restoreGuts (RWvistream& stream) ; } ; 

Detailed Description Paragraph Table (12) : 
attribute 

class'DX RealAttribute * 
CLASS NAME : DX_RealAttribute * INHERITED FROM : None * INHERITS : 
DX_CommonAttribute * DESCRIPTION : Provides storage of float types 
**★*****★★★★★*★*★***+★*****★★★★★ class DX RealAttribute : 

public DX_CommonAttribute { RWDECLARE_COLLECTABLE (DX_Real Attribute ) ; 

public: //if name==NULL, the 

name is set to "NOTJSET". //If name is ""or contains a " . " it will be set 

to //"INVALID NAME" 

DX_RealAttribute (const char* name=0) ; DX_RealAttribute (const char* name, float 
value) ; virtual . about . DX_RealAttribute 

( j. z*******************************^ //The memory for the 

value argument is allocated and //deallocated by the caller. The library 
function //GetAttributeValue just writes to the value argument //and the 
SetAttributeValue just reads the argument //to reset the value of the 
attribute EreturnCodes 
GetAttributeValue (void* value); //argument assumed to be float* EreturnCodes 
GetAttributeValue (floats value); EreturnCodes SetAttributeValue (void* 
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value); //argument assumed to be float* EreturnCodes SetAttributeValue ( float 
value); void PrintContents ( ); private: float AttrValue; void saveGuts (RWFile& 
file) const; void saveGuts (RWvostream& stream) const; void restoreGuts (RWFile& 
file); void restoreGuts (RWvistream& stream); }; 

Detailed Description Paragraph Table (13) : 
attribute 

class*DX BlobAtt ribute /**************************************** * 
CLASS NAME : DX_BlobAttribute * INHERITED FROM : None * INHERITS : 
DX_CommonAt tribute * DESCRIPTION : Attribute storage class to store raw * binary 
stream * stored in a unsigned char* * Takes/returns DX_Blob struct 

class DX BlobAtt ribute : 
public DX_CommonAttribute { RWDECLARE_COLLECTABLE (DX_BlobAtt ribute ) ; 
public: /************************************** //if name— = NULL / the 

name is set to M NOT_SET". //If name is or contains a " . " it will be set 
q //"INVALID NAME 11 

DX_BlobAtt ribute (const char* name=0) ; DX_BlobAtt ribute (const char* name, const 
DX_Blob& value); DX_BlobAt tribute (const char* name, unsigned char* value, unsigned 
int size); virtual . about . DX_BlobAt tribute 

( ); //Get a copy of DX Blob* 

value. The library will allocate the //appropriate storage, the user should delete 
the pointer after use. 

EreturnCodes GetAttributeValue (void* value) ; //assumes 

Blob** //Get a copy of type 

specific value. The library will allocate the //appropriate storage, the user 
should delete the pointer after 

use. / EreturnCodes 

GetAttributeValue (DX_Blob* & value) ; EreturnCodes GetAttributeValue (unsigned char* 
&value) ; //The memory for the 

value argument is allocated and //deallocated by the caller. The library functions 
just read //the value argument to reset the value of the attribute. // //NOTE: the 
method taking (void* value) assumes //const 

DX Blob* EreturnCodes 
SetAttributeValue (void* value); EreturnCodes SetAttributeValue (const DX_Blob& 
value); EreturnCodes SetAttributeValue (const unsigned char* value, unsigned int 
size); unsigned int GetBlobSize( ); void PrintContents ( ); private: DX_Blob* 
AttrValue; void saveGuts (RWFile& file) const; void saveGuts (RWvostream& stream) 
const; void restoreGuts (RWFile& file); void restoreGuts (RWvistream& stream); }; 

Detailed Description Paragraph Table (14) : 

class : DX_Blob The DX_Blob class is a reference counted container class used by the 
DX_BlobAttribute to store the attribute's value. It provides the user a way to keep 
down the overhead associated with having to copy the data, class DX_Blob : public 
RWCollectable { RWDECLARE_COLLECTABLE (DX_Blob) ; friend class DX_BlobAt tribute ; 
friend class DX_MultiValueAt tribute; public: DX_Blob ( ); DX_Blob (const unsigned 
char *value, unsigned int size); . about . DX_Blob ( ); DX_Blob& operator= (const 
DX_Blob& rhs); void PrintContents ( ); unsigned int GetBlobSize( ); EreturnCodes 
GetValue (unsigned char** value); private: int* refCount; unsigned int* BlobSize; 
unsigned char* BlobValue; void saveGuts (RWFile& file) const; void saveGuts 

(RWvostream& stream) const; void restoreGuts (RWFile& file); void restoreGuts 

(RWvistream& stream); }; 

Detailed Description Paragraph Table (15) : 
attribute 

class : DX VersionAt tribute 

* CLASS NAME : DX_VersionAttribute * INHERITED FROM : None * INHERITS : 
DX_CommonAttribute * DESCRIPTION : Provides a integer value that * : can be used to 
mark the version of * : an object in process. 

/ class 

DX_VersionAttribute : public DX_CommonAttribute { RWDECLARE_COLLECTABLE 
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ABSTRACT 

Internet search is one of the most important applications of the 
Web. A search engine takes the user's keywords to retrieve and to 
rank those pages that contain the keywords. One shortcoming of 
existing search techniques is that they do not give due 
consideration to the micro- structures of a Web page. A Web page 
is often populated with a number of small information units, 
which we call micro information vnits (MIU). Each unit focuses 
on a spcofic topic and occupies a specific area of the page. 
During the search, if all the keywords in the user query occur in a 
single MIU of a page, the top ranking results returned by a search 
engine are generally relevant and useful. However, if the query 
words scatter at different MJUs in a page, the pages returned can 
be qiule irrelevant (which causes low precision). The reason for 
this is that although a page has information on individual MIUs, it 
may not have information on their intersections. In this paper wc 
propose a technique to solve this problem. At the off-line pre- 
processing stage, we segment each page to identify the MIUs in 
the page, and index .the keywords of the page according to the 
MIUs in which they occur. In searching, our retrieval and ranking 
algorithm utilizes this additional information to return those most 
relevant pages. Experimental results show that this method is able 
to significantly improve the search precision. 

Categories and Subject Descriptors 

H. 3.3 [information Storage and Retrieval]: Information Search 
and Retrieval - Retrieval models, Search process. 

General Terms 

Algorithms, Experimentation. 

Keywords 

Web Search, Micro Information Units, Web page segmentation. 

I. INTRODUCTION 

One of the most important applications of the Web is the search 
using search engines, e.g., AltaVista, Google, Yahoo, etc. These 
search systems allow the user to specify some keywords to 
retrieve those Web pages that contain the keywords. A major 

Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies 
are not made or distributed for proGl or commercial advantage and that 
copies bear this notice and the full citation on the Cist page. To copy 
otherwise, or republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee. 
CIKM'02. November 4-9, 2002, McLean, Virginia, USA. 
Copyright 2002 ACM 1-58 1 13-492-4/02/001 1 ...S5.00. 



shortcoming of the current techniques is that they do not consider 
different topic areas of a page. Typically, the contents of a Web 
page encompass a number of related or even unrelated topics. 
Each topic usually occupies a separate area in the page. We call 
each topic area a micro information unit (or MIU in short). For 
example, a bookstore Web page selling books may include other 
diverse information like stock market quotations and weather 
forecasting. A personal homepage may contain information on 
different interests of its owner. 

A micro information unit (MJU) is a coherent topic area according 
to its content, and it is usually also a visual block from the display 
point of view. If the user*s query terms (or keywords) occur in a 
single MIU of a Web page, the pages returned by a search engine 
are generally relevant and useful. However, if the keywords 
scatter at different MIUs, it can cause low precision of the 
returned search results. Although many search engines are able to 
consider relative distances of keywords [4] (among others, e.g., 
word frequency, authority and hub scores, etc) in a Web page in 
their ranking processes, they do not consider whether these words 
occur in different MIUs or a single MIU. 

Let us use an example to illustrate the problem. For instance, we 
wish to find some free downloadable videos. We issue the search 
query "free download video" to the search engine Google. Google 
returns a large number of Web pages. However, most top ranking 
pages do not offer any free downloadable videos. For example, 
the first page returned by Google docs contain the three keywords 
"free", "download'* and "video" (Figure 1). it is a site that sells 
software for playing audio and video. It does not have any free 
video for downloading. From Figure 1, we observe that 'free", 
"download" and "video" (circled in the figure) appear in different 
Ml Us or topic areas. 

In this paper, we propose a technique To "deal" with the problem" 
The key idea is to segment each Web page to identify different 
micro information units or topic areas according to its HTML lags 
and contents. In searching, if the keywords of a query occur in the 
same MIU, the Web page will be given a higher ranking score. 
Otherwise, it will be given a lower ranking score. In the proposed 
technique, page segmentation and indexing according to MIUs in 
a Web page is done in off-line pre-processing. Wc show that the 
additional information on MIUs can be naturally integrated with 
inverted lists indexing commonly used by Web search engines, in 
on-line search, our retrieval and ranking algorithm makes use of 
this MIU information to sort the relevant pages. Due to seamless 
integration of MIUs with inverted lists, additional computation 
required during searching is minimum. 
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Figure 1: The first page from Google for the query, "free download video" 



i 
p 

I 

'1 

n 



The proposed technique is intended to be used as an advanced 
search option for a search engine (which we aJso call the baxe 
search engine). That is, when the precision of the results relumed 
by the base search engine is low, we can employ the proposed 
technique to re-rank the results. 

To evaluate the proposed technique, we used Google as the base 
search engine. When the precision of the returned results by 
Google is low, we re-rank its top 200 pages. Experimental, results 
(including comparison wilh Google) show that our method is able 
to improve the search precision dramatically, i.e., after re-ranking 
the number of relevant pages at the top of the list increases 
^significantly. 

2. RELATED WORK 

The key issue in Web search is how to efficiently retrieve relevant 
Web pages with high precision for its lop ranking results. The 
main technique used in current search engines is keyword 
matching. In recent years, a number of works were reported to 
improve search by using additional information from Web pages. 
[4] presents the Google search engine, which employs link 
structures and anchor text in addition to Ihc traditional factors 
such as word occurrence and frequency to make relevance 
judgments. [18) presents a similar link-based search method. [16] 
examines the actual pages suggested by multiple search engines 
and then displays Ihc results according to ihe user's query. Clever 
search engine [6] incorporates several algoriUims (i.e. HIT 



algorithm) that make use of hyperlink structure for discovering 
high-quality information on the Web. [9] describes a Web 
searching method where the input is the URL of a page. It only 
uses the connectivity information to identify related pages. These 
works are all different from ours, as they do not segment each 
Web page to identify different MJUs and then use these Mills to 
aid the search. 

[26] presents an algorithm to efficiently retrieve information 
units, which are logical Web document consisting of multiple 
physical pages. Our micro information unit is a topic area within a 
physical Web page. 



[5] uses the Document Object" Model "(or' HTML lag tree) and 
hyperlinks for topic distillation. It segments the HTML tag tree 
for the purpose of computing authority and hub scores of the 
inlermediate subtrees in relation to other pages and links. This is 
different from our work as wc aim to find coherent topic areas of 
the current page using both text contents and display properties. 

Segmentation of texl documents has been studied extensively in 
information retrieval Existing techniques roughly fall into two 
categories: lexical cohesion methods and multi-source methods 
[7J. The former identifies coherent blocks of text with similar 
vocabulary [2, 10, 15, 23). The latter combines lexical cohesion 
with other indicators of topic shift, such as relative performance 
of iwo statistical language models and cue words [1, 3). In [11]; 
Hearst discussed the merits of imposing structure on full-length 
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text documents and reported good results of using local structures 
for information retrieval. 

There are also several works on passage retrieval method 
although mostly applied in the area of lexl retrieval [14 21] [24] 
proposes sub-document access: it allows the user to zoom in on 
parts of (he full text that are meaningful to his task. [S] utilizes the 
reciprocal of the length of each passage as an estimate of its 
relevance. [20] discusses that the query occurs in unrelated 
context results in non- relevant document return. 

Our Web search based on MIUs is different from the research 
above. The nature of Web pages differs from a sialic text 
document. Weh contents can switch from one topic * to a 
completely different topic abruptly without requiring additional 
textual cues to ■'bridge the topic sliiff . Gradual topic shifts in text 
documents are often indicated by certain textual cues (e g. "next 
consider...", "firstly... secondly..."). Such cues, employed by text 
segmentation, are not applicable to Web pages. In our case we 
considered the changes in visual cues (e.g., bold emphasis, sudden 
increase in font-size, change in font-color) of Web pages. Visual 
cues offer indications (hat a topic may have shifted within a Web 
page. In our Web page segmentation, we make use of both 
contents and presentation styles or visual features of the Web 
page to segment the page. 

3. PRE-PROCESSING WEB PAGES 

We now present the proposed technique. This section focuses on 
pre-processing of each Web page, i.e., building a HTML tag tree 
and segmenting the Web page into MIUs using the tag tree The 
next section describes our ranking algorithm. All the procedures 
discussed in this section are done off-line. Since the proposed 
technique is used as an advanced search method or option for a 
base search engine, all the required information is assumed to be 
stored at the base search engine site, 

3.1 Building HTML Tag Trees 

Web pages arc hypertext documents written in HTML that 
consists of plain text, tags and links to image, audio, and video 
files, etc. Like most search engines, our technique only uses plain 
text and tags in search. Plain text are strings of characters not 
embedded within any tags. Jt can have different appearances in 
terms of color, font, size and style as specified by lags Tags 
(enclosed by a pair of angular brackets) define the display 
properties and characteristics of a Web page. In general most 
Web documents are constituted of opening and dosing pairs of 
HTML lags _(indicaled_by < > and </> respectively).. Vrtthin each 
corresponding lag-par, it can contain other pairs of tags, resulting 
in nested blocks of HTML codes. 

Based on the nature of nested structure of HTML codes a Tag 
Tree can be built in a fairly straightforward manner for each Web 
page using its HTML source. A node in the tag tree contains a lag 
name, content text and its display attributes (color, font, size, etc). 

3.2 Segmenting theTag Tree into MIUs 

We now segment the Web page into various MIUs using the tag 
tree. Although the tag tree already gives us an initial segmentation 
of the page, it is often too refined and is solely based on 
presentation features of the page. We need to merge some nodes 
in the tree to form coherent topic or information units Our 



segmentation technique is based on both content and display 
similarities. 

Merging of nodes is done in two steps: (1) merging each heading 
and its immediate content paragraph (note that a content 
paragraph may not have the <p> and </p> tags); (2) merging two 
adjacent text paragraphs. Below, we discuss these steps in turn. 

Step 1 - Merging each heading and its immediate content 
paragraph: In this step, we scan all the sibling nodes of a sub-tree 
from left to right to find all heading and paragraph pairs. This is 
performed in 2 sub-steps: 

(i) Identifying all potential heading and content paragraph 
pairs: Let A and B be any two different leaf nodes of a sub-tree. 
We use ten to denote the length (number of words) of die text 
string stored in A or B. We use tagRank to denote flic font 
emphasis given to the text strings stored in A or B. The value of 
tagRank is based on Apriority. The highest value of tagRank is 
assigned to header lags (e.g..: <h I >, <h2> and so on), followed by 
formatting Ugs (e.g. <b> -<strong>, <blink>) and enlarged font 
sizes (<big>, <size..>). All the other lags are assigned the same 
rank value that js lower than the three types above. In general, a 
paragraph heading tends to be more prominent and distinct in 
terms of font size or appearance as compared to its content 
paragraph. 

We use Neig(A, B) to denote the neighboring-relation at A and B 
and node A is the left neighbor of B. The following condition is 
used to determine whether A is a potential heading for B (or B is 
A*s immediate content paragraph). ({ArS) * 0) means at least 
one word (term) in A (node A) also occur in its immediate text 
paragraph B). 



(tagRank(A) >= tugRank(B)) a 

(Len(A) <Len{B)) a Neig(A, B) a (ArxB) * 0) 



(J) 



Here, wc use Ihe length of A and B t font size attributes of A and B 
and their neighborhood relation to check whether A is potentially 
a heading for B. Note that this is computed after stop-words 
elimination and word stemming have been performed. Wc use the 
Porter's algorithm given in [22J for the purposes. 

(ii) Further evaluation: After (i), we have identified all Ihe 
potential pairs. This sub-step further evaluates them using their 
display properties. For each {A, B) pair, we try to find the next 
pair (C, D) which also has a possible heading and content 
paragraph relationship as computed in sub-step (i). We then 
evaluate A t B, C and D using the display similarity, DisplaySim 
DisplaySim .counts, the-number of -identical - features (display - 
properties) of any two nodes. We use the following condition- 



3(C, D), NeigiC, D) a {D'isplaySim{A, Q : 
a (DupIaySim{B, D)>= S ) 



S) 



(2) 



where DLsp!aySim(X, Y) = \Xjeatures n Y.features\ (which is the 
size of Ihe intersection). The set of features includes font, size 
color, Ug name, and default. We set 8 = 3 (delcrrnined from 
experimental observations), which means if (DisplaySim(A Q >= 
3) we consider they have high display similarity (this also applied 
to B and D). 

This condition basically tries lo sec whether A and B have a 
parallel pair (C, £>). If so, wc confirm the heading and content 
paragraph relationship of A and B, and that of C and D Wc 
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Figure 3: An inverted list with data structures for Mills 



believe that the display property comparison is more meaningful 
here since people often are able to segment a Web page correctly 
even th«y do not know the conlenl of the page. 

If conditions (i) and (ii) are both satisfied, we merge nodes A and 
and at the same time C and D, i.e., to put the attributes of B 
into A, and the attributes of/) intn C. Nodes B and O are deleted. 

Step 2 - Merging two adjacent text paragraphs: Here, we wish to 
join similar text paragraphs (some paragraphs may contain their 
headings after step 1). Let A' and Y be two text paragraph nodes 
within the same sub-tree. We now compute their degree of 
content similarity, ComentSim(X, Y). The inner product [17] is 
employed for the purpose (m is the total number of terms or 
keywords inA'u y). if tenn /cxists in ^ = ^ olheiwi 

- 0. If i exists in Y, x, ' - I , otherwise jr, ' = 0. 

ContentSirn (X,Y) = fx i x , i (3) 

if ContentSim{X y Y) > m , wc say (hat nodes X and / have a high 
similarity. We can combine their contents, i.e., placing the content 
of / into A'. Wc set w = 2, determined by experiments. It reflects 
the acceptable level of similarity among variu us nodes well. 

The overall algorithm is given in Figure 2. maxDepth is the 
maximum depth of the original tag tree. treeDepth is the depth of 
the tree that is being worked on. Stree is the set of all sub-trees at 
depth treeDepth. Each subtreti only contains leaf nodes and no 
sub-trees below. 

1 for {treeDepth = maxDepth -1 ; treeDepth < 0; treeDepth-) do 

2 Stree - (subtree^ subtree, is a sub-tree at level treeDepth ) • 

3 while \Siree\ > 0 do /* \Stree\ is the size of the set Stree V 

4 for each subtreej e Stree do 

5 fo r each Neig(A , B) do 

6 if conditions (1) are satisfied then 

7 if 3pair(C, D) & conditions (2) is satisfied then 

8 Merge node A and B; 

9 endfor 

"TO cn'dfor" "~ 

11 for each subtree, e Stree do 

12 Scan all the nodes and their sibling nodes; 

13 if ContentSim{X t Y)>m ,thcn 

14 Merge node X and Y; 

15 endfor 

1 6 for each subtree, e Stree do 

17 If A has no sibling then 

1 8 move its content into its parent and delete it 

19 endfor 

20 endwhile 

21 endfor 

Figure 2: Merging nodes of a tag tree (segmenting a page) 



4. THE RANKING ALGORITHM 

After obtaining the MIUs from each page through segmentation, 
we index the Web pages in such a way thai they can be retrieved 
and ranked quickly. As in normal search, we also use inverted 
lists to store the information of the Web pages. Thus, the search 
technique we adopted is similar to those in a normal search engine 
[4]. The main difference is that in our technique we need to index 
and retrieve MIUs of each page.' We simply add an extra data 
structure to each inverted list node to indicate in which MIUs 
each word appears. Figure 3 illustrates the inverted lists indexing 
with the data structures for MIUs: 

Here Item-, (i =1, 2, „.) is a word, D fJ is the >-th document that 
Items occurs in and MIU^ is the y-th MJU in the x-th subtree of 
the page D n . Each node includes three fields: ID (document ID) 
seg (a pointer to all the MIUs of the page containing ltem^ nex't 
(a pointer to next page). Note Ay in an inverted list is stored in the 

increasing order. For any user query, Q = {k ly k 2 , k n ], we 

will consider if (he words occur in the same or neighboring MIUs 
in the same subtree of the same page. 

A search engine typically considers many factors in its ranking 
' tywtiun, e.g., hyperlink information (such as authority score and 
hub scorel word count-weight, type-weight (title, anchor, URL, 
font size, etc), and type- P rox-weight (how close multi-words* 
occur in every type) [4]. In our ranking algorithm, we only focus 
on whether the query terms occur in a single Ml U (or 2 
neighboring MIUs within the same sub-tree) of a page. Since the 
proposed technique is intended to be used as an advanced search 
method for a base search engine, we utilize our MlU-based 
information and also the ranking information from the base search 
engine in our ranking process. The reason that we need ranking 
information from the base search engine is because we do not 
need to consider other factors except our MlU-based factor in our 
ranking algorithm. However, since we do not have access to any 
existing search engine program, only the ordering information of 
_ the pages returned by the base search engine is employed in our 
current- ranking algorithm. lf aVearch engine system is available- 
all factors should be integrated in a more sophisticated manner 
Section 5 shows that even this simple approach is already able to 
produce remarkably good results. 

The proposed ranking method aims to re-rank the results returned 
by the base search engine when the precision of its results is poor 
The number of pages to he re-ranked is specified by the user. In 
our experiments, we re-rank the first 200 pages from Google Re- 
ranking is done on-line at query time. Pre-processing as discussed 
in Secuon 3 is done off-line for all the pages at the search engine 
site, as it is not possible to know what queries will be issued by 
users, and it is loo slow to do pre-processing of the top ranking 
pages from the base search engine at query time. 
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Our ranking algorithm basically computes two scores for each 
page, a primary score and a secondary score. The primary score 
is the maximum number of query terms that occur in a MIU of the 
page p. Let seg, be the terms contained in /-th MIU of p and 
queryTerms be the set of terms in the user query. The primary 
score of page/? (denoted by prScore(p)) is computed as follows: 

prScore (p) = arg max (| seg. n queryTerms \) (4) 

if the primary score of page p is less thari the number of query 
terms O.e., not all query terms arc covered), we compute the 
secondary score, which takes into account of the neighboring 
MJU on the right of each MJU in the same sub-tree. Let seg r be 
the set of terms in the ;th MJU of the sub-tree/ The secondary 
score oip (denoted by seScore(p)) is computed with: 

seScore(p) 

= ^max(arginax(|(^ y/ u.v^ y(<+|) )n queryTenns |)) (5) 
The overall ranking algorithm is given in Figure 4. 

1 Create a set of variables pageSei h i = 1 , . . . t „ ; 

2 pugeSet;=0; 

3 for ah> <= AUPages do prScore[p) = 0, seScore[p] = 0- 

4 Retrieve the inverted lists of the query words Jfc, k 2 k ■ 

5 Initialize pointer set L={p hPj , ....,„>, hcrc cach^ point to 
Uie tirst node in the corresponding link list; 

6 while 3p;±m,pj€ L do 

7 md = mm{piJd), Pi e i ; 

K Construct a pointer set IS from I: {p:\p- id - md\ • 

9 if \LS\=1 then ' 

1 0 prScor^pj. id}~l t seScore[pj. id] = l; 

1 1 else scan all the Ml Us to compute prScore and seScore 

of the page by checking if the query words occur 
in the same or neighboring Ml Us. 

i 2 for all Pj e IS d o Pj = P/ next; 

13 end while; 

1 4 for each p e AUPages do 

15 if /,rScor C (p) = „ then pogeSet„ =pageSet„ o {p}; 

16 else ifjraJ ft ir«^-/then W «te / - W flS WlU , {pi ; 

17 endfor ^' 

18 Rank pages in the order of pageset m pageset B . t and so on. 
ror the pages m each pagcset h we follow their relative 
ranking in the results produced by the base search engine; 

Figure 4: The ranking algorithm 
_ In Figure 4, n is the number of query terms. AUPages is the set of 
|op-rankin^ th^bise-ic'aFcireniinr' ~ 

Lines 1 and 2 create and initialize a set of set-variables to store 
the resulting pages as the first level ranking (which will become 
clear below). Line 3 initializes two arrays used to store the final 
Web page scores. Given user's query, lines 4 and line 5 retrieve 
Oie inverted lists of the query words and then create a pointer set 
bach pointer in the pointer set points to the first Web page of an 
inverted list. From line 6 to line 13, we compute prScore and 
seScore for all the WeiTpages. The loop ends when all the 
pointers reach the end of the inverted lists, which means we have 
already finished processing all the Web pages in the inverted lists 
In each loop, for all retrieved inverted lists, we first find the pace 
with smallest document ID (md). After we process it (give 111 
page &t prScore and seScore scores), we move the pointers that 



point to the smallest document JDs to the next node to begin the 
next loop, hi a loop, if the page contains only one query word, 
both prScore and seScore are given the score of 1. Otherwise, the 
page contains at least two words in the user's query Then we 
need lo check if they occur in (he same or neighboring Ml Us In 
this process, we update the maximal number of query words 
contained in a single MIU and two neighboring MlUs In line 15 
if prScoreip) = n> p should be one of the top ranking pages, stored 
in pageSct* (since we believe thai a MJU in a Web page that 
contains all the query terms is very likely to be relevant to the 
user). \i prScoreip) is less than n, we store p into pagevet - 
according to its seScore, where n-i indicates how many keywords 
are found in two neighboring MJUs (line 16). Finally, we have a 
two-level ranking (line 18). The first level ranks the sets of pages 
in the order of pagcset n pageset^ and so on. The second level 
ranks all the pages m each pageset, according to their relative 
ranking in (he results produced by the base search engine. 

The complexity of our algorithm is similar to the complexity of a 
normal search engine. In a normal search engine, given a query 
its main task is to check if the query terms occur in the same page' 
so the complexity is \query\+v on average (we ignore the other' 
cost in computing authority score, hub score, word count etc) 
Here \query\ is the number of query words, and v is the average 
length of all inverted lists. In our algorithm, we also need lo check 
m each page whether the query words occur in the same or 
neighboring MJUs. Thus, it needs to traverse IheMJU list of each 
page. Then complexity of our algorithm is \query\*v*q Here q is 
the average number of MJUs in all the pages. Since q is normally 
very small, thus little extra time is needed by our new search 
technique. 

5, EXPERIMENTAL RESULTS 

This section evaluates the proposed technique. We first compare 
the precision results of our method with those from Google and 
then discuss its running efficiency. 

Evaluation of the ranking effectiveness is difficult in the context 
of web search because of the difficult tasks in (i) choosing queries 
and (n) evaluating the relevance of search results. Our criteria for 
choosmg queries are: (hey should be irom diverse areas and 
unambiguous. By unambiguous, we mean that (he intent of each 
query is agreed upon by a panel of 3 Judges. We used queries 
/,°c?^ ™ de * cndcRi sources > entire collection of queries 
(35 1-400) from TREC-7 [25], and 30 queries from Metaspy of 
MetaCrawler [19] (which allows users to view others* queries 
being .submitted to thesyslem). JFor queries from Metaspy, we 
first collected a list oT continuous queries and then removeS I thole" 
queries that are ambiguous, i.e., our panel of judges could not 
decide the intension of the user. 

As for evaluating the relevance or correctness of the search 
results, the web pages produced should satisfy the conditions pre- 
defined by our judges or correspond to the standard narratives 
provided by TR£C [25]. For example. TREC Query 354- 
Journalist Risks, the narratives staled arc "any document' 
identifying an instance where a Journalist has been killed, 
arrested or taken hostage in the performance of hi, work h 
relevant" Our judges evaluate the relevance of the search results 
with such narratives to obtain a consensus on (he search precision. 
The choice of using Google as a basis for re-ranking {base search 
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engine) is because of its state-of-the-art search mechanism. In 
general, Google performs very well as a general -purpose search 
engine. However, there exist query phrases that it foils to perform 
satisfactorily. Our purpose is to provide advanced re-rankings for 
queries whose Google's precisions arc low. Fur each query we 
re-rank the first 200 search results from Google, after crawling 
ajid pre-processing the pages. 

In general, information retrieval systems are evaluated using both 
precision and recall measures. However, in the context of Web 
search, the precision of the top-ranking results returned by a 
search engine is more important since most people only see the 
top 20-30 results [12, 13]. That is, even if a search engine has 
high recall, but if most of the relevant results are located below 
20-30 top ranking results, there is little chance that the user will 
see them. Thus, many researchers believe that high precision is 
important even at the expense of recall [4J. In our experiments 
we are only using precision of (op 20 ranking results to evaluate' 
the performance of our system. 

The precisions of the top 20 ranking results from Google and our 
rnethod (MJU) are compared in Table 1. The first column states 
the source of queries. The second and third columns list the 
average precisions of the top 20 results from Google and our 
method respectively. The fourth column provides the 
improvement percentage of MJU over Google on die average 
precision. Tables 3 and 4 in the Appendix list all the search 
queries and the corresponding precision for both data collections. 
Tabic 1. Average precision comparison for 



Data Collections 


Average Precision 


Improvement 


Google 


MJU 


TREC-7 (351-400) 


0.50 


0.59 


18.00% 


MetaSpy 


0.54 


0.63 


16.67% 




Figure 5: Average precision comparison per 5 returned pages_ 
of Mill and Google for TREC7 queries 




Figure 6: Average precision comparison per 5 returned pages 
of MIU and Google for MctaSpy queries 

From Table 1, we observe that the average precision after our re- 



ranking is substantially higher. The improvement in precision by 
our system over that of Google is 18% for TREC-7 queries and 
16.67% for Metaspy queries. Figures 5 and 6 give the graphical 
comparison of the average precision of every 5 returned pages for 
both data collections. We observe that in general MJU is superior 
to Google for any number of lop-ranked pages (used in computing 
precision). 

Table 2 presents the won-lost-tied record of MJU against Google. 
For the Trec-7 data collection, 58% of the total number of queries 
increased in precision; 20% nf queries remained unchanged and 
22% of queries decreased in precision after applying our MIU 
method as compared to Google's ranking results. For the MetaSpy 
query collection, the precisions of 67% of the queries increased; 
the precisions of 3% of the queries remain unchanged and 30% of 
queries decreased. We observe that most instances of MIU 
performing worse than Google occur when the precisions of 
Google's results tend to be rathe/ high. For example, for those 
queries that Google has better results, its average precision is 0 73 
for the MetaSpy data collection. This precision value of Google 
should be highly satisfactory for most users and does not require 
additional MIU processing. That is why we say that our MIU 
method can be seen as an advanced search option, it should be 
used when Google's results are not satisfactory. 

Tabic 2: Won-lost-tied record for TREC-7 and MetaSpy 
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TREC-7 (351-400) 


58% 


20% 


22% 


MetaSpy 


67% 


3% 


30% 



We now briefly discuss the running efficiency of our system Wc 
use a single machine (Sun E450 250MH with 500MB memory 
and a single processor) for all our experiments. In pre-processing, 
the major operations involved are crawling and indexing, it is 
difficult to measure how long crawling took overall because of 
complications like bandwidth limitations, crashed name servers 
congested network and others. For indexing, the indexer runs at 
roughly 4 pages per second. Our indexer is not running in parallel 
which affects the speed. All these pre-processing are mostly 
duplicated works of Google. They can be easily incorporated into 
Google, which will improve the performance significantly For 
ranking, our ranking procedure handles roughly 50 pages per 
second, or 2 to 10 seconds for each query (which is mostly 
dominated by disk 10). Improving the efficiency of crawling 
- indexing- and searching-was not die -main- focus of this research-- 
With further optimization and more powerful machines, the 
running speed can be improved significantly. 

6. CONCLUSION 

In this paper, we presented a technique to improve the precision 
of Web search, it is based on the idea of segmenting each web 
page into different MIUs (topic areas) according to its contents 
and HTML tags. In searching, only the terms in a single unit or at 
most two neighboring units of a page arc used to match the user's 
quciy terms. This is different from existing techniques used by 
current search engines, which typically employ all the terms in 
the whole page to match the query terms. From the experiment 
results shown in Section 5, we observe that the precision of the 
ranking produced by our method is substantially higher 
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Tabic 3: Precision comparison 



using IvlctaSpy queries (the queries are ordered according to the won-tied-lost 
record against Google) 




572 



Tabfc 4: Precision comparison using TREC-7 (>vc reorder the queries in TREC-7 according to the >von 

record against Google) 



-tied-lost 



Search Query 



Google 



Mill 



Search Query 



Falkland petroleum exploration 



0.55 



0.70 



mercy killing 



journalist risks 



impact 



postmenopausal estrogen Britain 



human smuggling 



transportation tunnel disasters 



anorexia nervosa bulimia 



Food/drug laws 



0.25 



0.50 



0.05 
0.10 



0.60 



home schooling 



0.20 



auiombile recalls 



0.11 



0.60 
0.30 



0.85 



dismantling Europe's arsenal 



0.15 



0.20 



0.40 



euro opposition 



0.70 



0.70 



0.80 



0.35 



0.70 



0.75 



piracy 



in vitro fertilization 



0.15_ 
1.00 



0.30 



0.25 



0.60 



0.70 



0.35 



0.15 



Native American casino 



encryption equipment export 



Nobel prize winners 



hydrogen energ y 



obesity medical treatment 



alternative medicine 



mental illness drugs 



space station moon 



0.30 



0.45 



rabies 



0.45 



0.60 



El Nino 



1.00 



0.70 



1.00 



robotics 



1.00 



0.85 



0.95 



0.45 



tourism 



0.80 



0.95 



0.00 



0.75 



0.90 



sick building syndrome 



amaznn rain forest 



0.65 
0.70 



0.85 



0.60 
0.10 



1.00 



ocean remote sensing 



0.85 



0.20 
0.15 



0.50 



territorial waters dispute 



blood-alcohol fatalities 



0.55 



O60_ 
1.00 



mutual fund predictors 



1.00 



1.00 



0.45 
0.00 



0.60 



0.10 



0.40 



0.55 



0.80 



eaching disabled children 



radioactive waste 



organic soil cidiancemenl 



illegal technology transfer 



orphan drug s 



r&d drug prices 



0.65 



0.85 



0.45 



0.50 



drug legalization benefits 



0.45 



0.55 



clothing sweatshops 



0.45 



0.60 



antarcu'ea ex p loration 



0.30 
0.40 



0.45 



commercial cyanide uses 



0.70 



agar smoking 



0.30 



hydrogen file] automobiles 
oceanographic vessels 



0.50 
0.90 



0.80 



0.70 



0.68 



0.35 



0.90 



0.79 



0.75 



0.53 



0.25 



0.85 
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